Lets load the dataset and look at its structure and variables
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
It seems all of the variables are numerical. Even quality is numerical, but it could just as easily be categorical.
Lets first explore the distribution of quality, as it is our dependant variable in this research.
The qualities of wine seem to be somewhat normally distributed around the median of 6. The tail tail is slightly higher on the lower-quality side, with 5-quality wines being by far the 2nd most numerous quality after 6. It also seems that no wines were given either a 10, or 0-2. Additionally, only 5 wines were of quality 9. As vast majority of wines seem to have a quality of either 5 or 6.
Therefore since there are so few wines qualities under 4 or over 8, we will subset the dataset to exclude them.
In order to determine independent variable distributions and discover possible outliers, lets also plot histograms of the independent variables.
Fixed acidity seems very normally distributed with values falling between around 4.4 and 9.6. Lets look at the quantiles in the feature.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.851 7.300 14.200
Despite the data being quite normally distributed, there seems to be at least a few high outliers.
Looking at the distribution, lets cut the outliers by using only values less than 10. Values higher than that seem to be outliers.
Volatile acidity seems to have a bit of a long tail on the incresing side of the values. Lets look at the quantiles of volatile acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2779 0.3200 1.1000
The final 4th quantile of volatile acidity seems to be inside the long tail between 0.32 and 1.1. The values seem to have some very high outliers at the end of the long tail.
Lets cut out values over 0.70.
Citric acid seems to also be quite normally distributed (albeit with a long tail) with a few peculiar exceptions that can distort the interpretation of the data. Citric acid has huge spikes in frequencies at 0.5 and 0.75 it seems. Especially the one at 0.5 is curious as it almost rivals the most frequent citric.acid levels at around 0.3. This may be due to the fact that the distillers of more acidic wine may opt for this exact amount of citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The quantiles seem quite normally distributed with the exception of a few outliers in the final quantile. These spikes are only really noticeable in the visualization.
To exclude outliers, lets include only values less than or equal to 0.75 (to make sure the high citric.acid spike gets included).
Residual sugar seems very long tailed, with most wine not having much sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.379 9.900 31.600
As seen in the visualization, mean is significantly higher and final quantiles are quite long after the first quantile, indicating the long tail also seen in the visualization. There are also a few very large outliers.
To cut them out, lets include only values less than 20.
Chlorides seem very normally distributed with a few outliers at the end of a long tail. Overall the variance of the values seems quite low.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04557 0.05000 0.34600
The quantiles seem very even, although there are a few large values.
Lets cut the large values by only including values under 0.10.
Free sulfur dioxide is quite normally distributed except the right side of the curve seems slightly less steep. There are a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.12 46.00 138.50
The quantiles explain the same story as the visualization. The third quantile is 46 which is slightly longer than the 2nd quantile and the maximum value is 138.50 indicating some outliers.
Lets cut those out by only including values under 80.
Total sulfur dioxide seems also quite normally distributed with a slightly less steep curve on the right side. The variance in values seems quite high since the tails are not that steep.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 133.0 137.3 166.0 344.0
There also seems to be a few high values indicated by the maximum value.
Lets cut them out by only including values under 270.
Density values seems to have very little variance, almost all the values fall between 0.990 and 1.000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9960 1.0020
The 4th quantile seems to have a small tail but overall there seems to be no real outliers. Therefore there is no real need to subset density.
Similar to density, the pH-values have little variation with most of the values falling between 3 and 3.5. The distribution looks to be very normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.090 3.180 3.191 3.280 3.820
The quantiles seem to support what the visualization shows, there seems to not be any real outliers in the values. Therefore the subsetting of pH is unnecessary.
Sulphates seem to be a bit long-tailed on the right side and the values have several peaks between 0.4 and 0.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.22 0.41 0.48 0.49 0.55 1.08
The quantiles also indicate the existance of a less steep curve on the 3rd and 4th quantiles. And there are a few outliers on the end of the 4th quantile. The quantiles do not explain the peaks in the values which can only really be seen by visualizing the distribution.
Lets cut the outliers by only accepting values under 0.9.
Alcohol seems to have most of the values at the low end of the curve, and the higher alcohol amounts being less frequent. The lowering curve is quite linear with little outliers. The variance also seems to be quite high.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.40 10.54 11.40 14.20
After running ggpairs to view the relationships of features, the plots and the correlation values seem to indicate a poor correlation between the independant variables and the wine quality. This seems quite understandable as it is difficult to imagine there being linear relationships between wine quality and for example salt-, sugar-, and alcohol-content of the wine or acidity.
GGpairs output is not shown here as it looks poor on knit html. To get a clearer view of the variables relationships with wine quality, I will plot them as boxplots. The boxplots are a good way to interpret variation of values of the independent variables against our dependant variable, which can be interpreted as a categorical variable.
In the boxplots we visualize some promising possible independent variables vs wine quality.
Looking at the ggpairs-output, fixed.acidity, volatile.acidity, residual.sugar, total.sulfur.dioxide, citric.acid, density and alcohol seemed like the most promising dependant variables to affect wine quality.
In order to get a better estimate on the variance in the independent variable values, lets create boxplots of the relationships between the independent variables and wine quality.
Looking at the relationships in the boxplot, there still seems to be some outliers in the fixed acidity-values. Looking at the means and the middle quantiles, it seems that the correlation with wine quality seems quite low, non-linear and even non-modal. At qualities 4-7 the fixed.acidity seems to reduce but at quality of 8 the fixed.acidity rises again.
Lets zoom a bit closer to see the middle quantiles better
In the zoomed boxplot the correlations seem to indicate the same conclusions as in the non-zoomed one.
Due to the correlation looking obviously nonlinear and non-modal, the correlation tests will likely provide an incorrect metric on the correlation of the variables.
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$fixed.acidity
## S = 1.7907e+10, p-value = 7.214e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.07900588
The correlation looks very low but it might be higher in actuality due to the non-modal nature of the relationship.
In volatile acidity the variance of values between different qualities seems higher. The correlation seems non-modal and non-linear here as well.
In the zoomed plot, the volatile acidity seems to drop at first between qualities 4-6 and then stay at somewhat similar levels.
Lets run a Spearman correlation test to quantify this correlation. Spearman is used as the correlation seems to be nonlinear.
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$volatile.acidity
## S = 1.9539e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1773517
The variables seem to have slight negative correlation as the boxplots suggest.
Residual sugar seems to have fewer outliers in the boxplot than the previous variables. Also the variance between qualities seems nonexistant in the first quantile of the data and then the variance will rise significantly in the next quantiles. This seems to follow the significant long-tailed pattern seen in the distribution of residual.sugar.
Around the mean the quality and residual.sugar correlation seems to follow a strong bimodal pattern. Low quality wines seem to not be as sweet as the average wines and the sweetness starts to drop again once we move to higher quality wines (although 8-quality wines seem slightly sweet than the 7-quality ones).
Besides the non-modal correlation pattern, there seems to be significant variation between residual.sugar values and quality. But due to the non-modal nature of the relationship, a correlation test may give slightly misguided results and hence I will trust the visualization on the nature of the relationship.
Lets do the correlation test anyway (using Spearman again):
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$residual.sugar
## S = 1.7922e+10, p-value = 5.065e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.07993064
The correlation value seems really low, which I feel does not tell the whole story on the relationship between residual.sugar and quality.
There seem to be a few rather large outliers here and once again some nonlinear and non-modal correlation can be seen in the data. It seems that the total sulfur dioxide starts lower at lower quality wines, then rising a bit at average quality wiens and then lowering again as the quality increases and then reaching its minimum at quality 7 and then remains at similar values at quality 8 as well.
Lets look at the Spearman correlation:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$total.sulfur.dioxide
## S = 1.9774e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1915279
Despite the nonmodal relationship of the variables, the correlation test shows some slight negative correlation. If we ignore the rising trend between qualities 4 and 5, the relationship does seem quite clearly like a modal, nonlinear, negative correlation.
Looking at the plot, there seem to be quite of few values outside the middle quantiles. The middle quantiles though seem to have quite low variance, especially at higher qualities. The relationship with quality also seems quite modal.
Zooming into the data, we can definately see some nonlinear positive correlation here.
Lets look the correlation with a Spearman test:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$citric.acid
## S = 1.6225e+10, p-value = 0.1288
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02231445
The test seems to indicate that the positive correlation is very low. This could, in part, be explained by the outliers in the dataset and the high variance of citric.acid the outer quantiles.
The relationship, once again, seems non-modal, but there is definite variation between different wine qualities causing us to believe that there is some sort of a relationship between density and quality.
Lets try a Spearman test:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$density
## S = 2.2324e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3451655
Despite the non-modality, there seems to be some definite correlation between the variables according to the Spearman test.
Once again, there seems to be a clear non-modal relationship between the variables. Interestingly enough, the pattern looks very similar to other independent variables. The correlation seems positive here though.
Lets try a Spearman test:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and wines_subset$alcohol
## S = 9258400000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4421265
Despite the non-modality, the positive correlation seems to clearly be there.
Next, lets try to find interesting relationships between some of the independant variables. Some of the relationships-of-interest might include alcohol vs density, residual.sugar vs alcohol, citric.acid vs volatile.acidity, citric.acid vs residual.sugar.
There seems to be some negative correlation between alcohol and density. As the density-level rises, the amount of alcohol reduces.
Lets see if any patterns emerge if we color the plot by quality.
There seems to be a slight pattern that high quality wines have both high amount of alcohol and low density. This complies with the discoveries when looking at the boxplots and correlations of alcohol vs quality and density vs quality.
Lets see how well the ratio of alcohol and density correlates with quality:
The pattern looks very similar to that of alcohol. This is due to the fact that density has very little variation between different qualities when compared to alcohol.
It is difficult to see any patterns here, so lets add some alpha to the plot:
It is difficult to find any relationships in this chart. Most of the values seem to be on the left side of the chart, which is caused by the fact that a significant amount of wines had a low sugar amount (as seen in the long-tailed pattern of the residual.sugar histogram). Additionally, it seems that sweeter wines have a lower alcohol amount, which is curious.
Lets see if any patterns emerge when coloring the scatterplot with quality
When looking at the plot, there seems to be a slight pattern of qualities being higher at the high alcohol-low sugar end of the plot. This, however, might not indicate a meaningful relationship here, instead the combination of positive correlation of alcohol vs quality and the dataset having mostly low-sugar wines could provide an explanation to this phenomenon.
Lets look at the ratio of sugar and alcohol vs quality in a boxplot
There does seem to be some variance between the ratio-values at different qualities. The relationship is clearly nonlinear and also non-monotonic.
Lets test the correlation using Spearman
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and (wines_subset$alcohol/wines_subset$residual.sugar)
## S = 1.4476e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1277507
There seems to be some small positive correlation in the model. The correlation is unreliable, however, as the relationship is clearly non-monotonic therefore breaking the assumptions of a Spearman test.
According to the plot, there doesn’t really seem to be any meaningful relationships between the variables.
Lets see if anything interesting comes up when coloring the plot with quality:
ggplot(aes(x = citric.acid, y = volatile.acidity, color = quality_factor),
data = wines_subset) +
geom_point()
Now this might be interesting. In the middle of the cluster, there seems to be a high concentration of higher quality wines. Could there be a relationship here?
Lets see if there is a correlation between the ratio of citric.acidity and volatile.acidity vs quality
There seems to be slight positive correlation seen in the boxplot.
Lets do a Spearman test:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and (wines_subset$citric.acid/wines_subset$volatile.acidity)
## S = 1.4628e+10, p-value = 5.604e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1185538
The test seems to comply with what the boxplot already told us - there does seem to be a small positive correlation.
This plot is also difficult to follow as the residual sugar values are so heavily skewed towards the low amounts of sugar. No clear patterns emerge from the plot, except perhaps the fact that it seems that low citric.acid is only given to wines with low amounts of sugar. But that may also be due to the fact that the vast majority of wines are low in sugar.
Lets see if we can find anything by coloring the plot with quality.
The doesn’t seem to be any visible patterns in the scatterplot.
Lets see if there is a correlation between the ratio of citric.acidity and volatile.acidity vs quality
The patterns here seem very small. Lets zoom the plot a bit to see more.
The correlation is tiny, but it seems to be there. The correlation looks nonlinear and non-monotonic, forming a kind of a wave-pattern in its relationship with quality.
Lets do a Spearman test:
##
## Spearman's rank correlation rho
##
## data: as.numeric(wines_subset$quality) and (wines_subset$citric.acid/wines_subset$residual.sugar)
## S = 1.4804e+10, p-value = 1.682e-13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.107994
The seems to be some small positive correlation in the relationship of the ratio of citric acid and residual sugar vs quality. The correlation test is, however, unreliable as the relationship is clearly non-monotonic.
Lets see what the correlation values for fixed.acidity, volatile.acidity, residual.sugar, total.sulfur.dioxide, citric.acid, density and alcohol were:
Note that the chart uses absolute values as we don’t really care about the direction of the correlation - only the size of it. Alcohol and alcohol/density look very similar in size - this is explained by the fact that they really are the same thing, as explained earlier.
Lets try to build a linear model for wine quality using independent variables with the largest perceived correlation.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines_subset)
## m2: lm(formula = I(quality) ~ I(alcohol) + alcohol + alcohol:density,
## data = wines_subset)
## m3: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + alcohol:density,
## data = wines_subset)
## m4: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## alcohol:density, data = wines_subset)
## m5: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + alcohol:density, data = wines_subset)
##
## ==============================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------
## (Intercept) 2.628*** 1.930*** -239.231*** -236.496*** -305.231***
## (0.098) (0.177) (35.021) (35.121) (34.260)
## I(alcohol) 0.310*** -2.796*** 20.767*** 20.609*** 26.209***
## (0.009) (0.656) (3.483) (3.487) (3.394)
## alcohol x density 3.193*** -20.544*** -20.386*** -25.994***
## (0.674) (3.512) (3.515) (3.421)
## density 242.881*** 240.109*** 309.650***
## (35.270) (35.373) (34.507)
## citric.acid 0.106 -0.229*
## (0.103) (0.101)
## volatile.acidity -2.076***
## (0.119)
## ------------------------------------------------------------------------------
## R-squared 0.195 0.199 0.207 0.208 0.256
## adj. R-squared 0.195 0.199 0.207 0.207 0.255
## sigma 0.773 0.771 0.768 0.768 0.744
## F 1124.586 576.114 403.732 303.068 318.724
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5384.008 -5372.809 -5349.199 -5348.667 -5202.000
## Deviance 2770.300 2756.945 2729.001 2728.375 2561.055
## AIC 10774.016 10753.618 10708.398 10709.334 10417.999
## BIC 10793.340 10779.383 10740.604 10747.983 10463.089
## N 4635 4635 4635 4635 4635
## ==============================================================================
It seems that the alcohol/density, density, and citric.acid affected the model very little.
Lets add the rest of the features:
##
## Calls:
## m6: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + alcohol:density + alcohol:residual.sugar,
## data = wines_subset)
## m7: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + alcohol:density + alcohol:residual.sugar +
## citric.acid:volatile.acidity, data = wines_subset)
## m8: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + alcohol:density + alcohol:residual.sugar +
## citric.acid:volatile.acidity + citric.acid:residual.sugar,
## data = wines_subset)
## m9: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + total.sulfur.dioxide + alcohol:density +
## alcohol:residual.sugar + citric.acid:volatile.acidity + citric.acid:residual.sugar,
## data = wines_subset)
## m10: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + total.sulfur.dioxide + fixed.acidity +
## alcohol:density + alcohol:residual.sugar + citric.acid:volatile.acidity +
## citric.acid:residual.sugar, data = wines_subset)
## m11: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + total.sulfur.dioxide + fixed.acidity +
## residual.sugar + alcohol:density + alcohol:residual.sugar +
## citric.acid:volatile.acidity + citric.acid:residual.sugar,
## data = wines_subset)
## m12: lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + total.sulfur.dioxide + fixed.acidity +
## residual.sugar + sulphates + alcohol:density + alcohol:residual.sugar +
## citric.acid:volatile.acidity + citric.acid:residual.sugar,
## data = wines_subset)
##
## ===================================================================================================================
## m6 m7 m8 m9 m10 m11 m12
## -------------------------------------------------------------------------------------------------------------------
## (Intercept) -167.504*** -166.448*** -180.885*** -160.624*** -165.984*** -240.560*** -228.909***
## (37.333) (37.330) (39.480) (39.741) (39.890) (66.044) (65.867)
## I(alcohol) 22.809*** 22.680*** 24.097*** 23.449*** 22.791*** 29.574*** 31.372***
## (3.387) (3.387) (3.615) (3.612) (3.638) (6.013) (6.002)
## density 172.154*** 171.235*** 185.782*** 165.234*** 170.649*** 246.063*** 234.368***
## (37.546) (37.542) (39.712) (39.979) (40.131) (66.661) (66.483)
## citric.acid -0.082 -0.540 -0.386 -0.329 -0.291 -0.312 -0.318
## (0.102) (0.283) (0.314) (0.314) (0.315) (0.315) (0.314)
## volatile.acidity -2.093*** -2.542*** -2.526*** -2.496*** -2.501*** -2.541*** -2.519***
## (0.118) (0.283) (0.284) (0.283) (0.283) (0.285) (0.284)
## alcohol x density -22.706*** -22.575*** -24.009*** -23.360*** -22.681*** -29.540*** -31.397***
## (3.413) (3.413) (3.643) (3.641) (3.668) (6.073) (6.063)
## alcohol x residual.sugar 0.005*** 0.005*** 0.005*** 0.005*** 0.005*** 0.010** 0.012**
## (0.001) (0.001) (0.001) (0.001) (0.001) (0.004) (0.004)
## citric.acid x volatile.acidity 1.486 1.491 1.193 1.222 1.229 1.250
## (0.854) (0.854) (0.855) (0.856) (0.855) (0.853)
## citric.acid x residual.sugar -0.021 -0.020 -0.022 -0.018 -0.023
## (0.019) (0.019) (0.019) (0.019) (0.019)
## total.sulfur.dioxide 0.001*** 0.001*** 0.001*** 0.001***
## (0.000) (0.000) (0.000) (0.000)
## fixed.acidity -0.025 -0.026 -0.006
## (0.016) (0.016) (0.017)
## residual.sugar -0.056 -0.062
## (0.040) (0.039)
## sulphates 0.594***
## (0.107)
## -------------------------------------------------------------------------------------------------------------------
## R-squared 0.269 0.269 0.269 0.272 0.272 0.272 0.277
## adj. R-squared 0.268 0.268 0.268 0.270 0.271 0.271 0.275
## sigma 0.738 0.737 0.737 0.736 0.736 0.736 0.734
## F 283.287 243.357 213.107 191.819 172.919 157.416 147.771
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5162.683 -5161.165 -5160.533 -5152.535 -5151.371 -5150.365 -5135.072
## Deviance 2517.972 2516.323 2515.637 2506.971 2505.712 2504.624 2488.152
## AIC 10341.365 10340.329 10341.065 10327.070 10326.741 10326.730 10298.145
## BIC 10392.896 10398.302 10405.479 10397.926 10404.038 10410.468 10388.324
## N 4635 4635 4635 4635 4635 4635 4635
## ===================================================================================================================
Most of the features affect the model very little. The ratio-features offer little else in explaining the variance of quality. It seems that the adding of sulphates will improve the model a little.
Overall, the proposed model explains 27.7% of the variation in quality, which is quite poor.
Lets see what went wrong:
##
## Call:
## lm(formula = I(quality) ~ I(alcohol) + alcohol + density + citric.acid +
## volatile.acidity + total.sulfur.dioxide + fixed.acidity +
## residual.sugar + sulphates + alcohol:density + alcohol:residual.sugar +
## citric.acid:volatile.acidity + citric.acid:residual.sugar,
## data = wines_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.94009 -0.53649 -0.02716 0.42322 2.57985
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.289e+02 6.587e+01 -3.475 0.000515 ***
## I(alcohol) 3.137e+01 6.002e+00 5.226 1.80e-07 ***
## alcohol NA NA NA NA
## density 2.344e+02 6.648e+01 3.525 0.000427 ***
## citric.acid -3.175e-01 3.144e-01 -1.010 0.312579
## volatile.acidity -2.519e+00 2.839e-01 -8.873 < 2e-16 ***
## total.sulfur.dioxide 1.071e-03 3.234e-04 3.311 0.000936 ***
## fixed.acidity -6.439e-03 1.661e-02 -0.388 0.698281
## residual.sugar -6.191e-02 3.950e-02 -1.567 0.117106
## sulphates 5.943e-01 1.074e-01 5.532 3.35e-08 ***
## alcohol:density -3.140e+01 6.063e+00 -5.178 2.34e-07 ***
## alcohol:residual.sugar 1.207e-02 3.746e-03 3.222 0.001282 **
## citric.acid:volatile.acidity 1.250e+00 8.527e-01 1.466 0.142738
## citric.acid:residual.sugar -2.258e-02 1.910e-02 -1.182 0.237269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7337 on 4622 degrees of freedom
## Multiple R-squared: 0.2773, Adjusted R-squared: 0.2754
## F-statistic: 147.8 on 12 and 4622 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: I(quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## I(alcohol) 1 672.45 672.45 1249.1376 < 2.2e-16 ***
## density 1 21.13 21.13 39.2511 4.065e-10 ***
## citric.acid 1 0.97 0.97 1.8060 0.1790508
## volatile.acidity 1 155.20 155.20 288.3006 < 2.2e-16 ***
## total.sulfur.dioxide 1 6.35 6.35 11.8021 0.0005969 ***
## fixed.acidity 1 23.68 23.68 43.9936 3.673e-11 ***
## residual.sugar 1 42.14 42.14 78.2839 < 2.2e-16 ***
## sulphates 1 15.47 15.47 28.7366 8.696e-08 ***
## alcohol:density 1 9.82 9.82 18.2509 1.975e-05 ***
## alcohol:residual.sugar 1 5.48 5.48 10.1731 0.0014346 **
## citric.acid:volatile.acidity 1 1.14 1.14 2.1249 0.1449932
## citric.acid:residual.sugar 1 0.75 0.75 1.3971 0.2372692
## Residuals 4622 2488.15 0.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value (Pr(>t)) seems quite high for citric.acid, citric.acid/volatile.acidity, and citric.acid/residual.sugar, causing me to believe that they did not have a significant effect on wine quality. These were the features using values of citric.acid causing me to definately believe that citric.acid did not have an effect on quality.
Lets remove citric.acid-features and create the final linear model:
##
## Call:
## lm(formula = I(quality) ~ I(alcohol) + density + volatile.acidity +
## alcohol + total.sulfur.dioxide + fixed.acidity + residual.sugar +
## sulphates + alcohol:residual.sugar + density:alcohol, data = wines_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9512 -0.5355 -0.0213 0.4228 2.5903
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.194e+02 6.556e+01 -3.346 0.000827 ***
## I(alcohol) 3.052e+01 5.968e+00 5.114 3.27e-07 ***
## density 2.247e+02 6.618e+01 3.395 0.000692 ***
## volatile.acidity -2.146e+00 1.195e-01 -17.961 < 2e-16 ***
## alcohol NA NA NA NA
## total.sulfur.dioxide 1.089e-03 3.205e-04 3.396 0.000689 ***
## fixed.acidity -9.185e-03 1.610e-02 -0.570 0.568403
## residual.sugar -6.891e-02 3.915e-02 -1.760 0.078484 .
## sulphates 5.831e-01 1.072e-01 5.439 5.63e-08 ***
## alcohol:residual.sugar 1.208e-02 3.741e-03 3.230 0.001247 **
## density:alcohol -3.054e+01 6.028e+00 -5.066 4.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7338 on 4625 degrees of freedom
## Multiple R-squared: 0.2766, Adjusted R-squared: 0.2752
## F-statistic: 196.5 on 9 and 4625 DF, p-value: < 2.2e-16
In this research I explored what chemical qualities could affect the quality of white wine.
It seemed that the most significant factor that affected white wine quality in this dataset was quite surprisingly alcohol. Here is a boxplot summarizing the relationship of the amount of alcohol and quality.
When exploring multivariate relationships, a curious relationship was found between citric.acid, volatile.acid and quality.
In this plot, the good quality wines seemed to cluster around a certain area. But even so, regression analysis showed with high confidence that citric.acid does not affect wine quality.
Finally, the correlations of each explored feature is displayed in this barplot:
In the plot, the bars are colored by the correlation direction, and the plot is sorted by the correlation size. It is evident in the plot that alcohol seems to be the most important factor in wine quality.
In this research I explored what chemical qualities could affect the quality of white wine.
First, the features were plotted as histograms to detect outliers and to discover some interesting patterns in their distributions.
Then, all of the features correlations with quality were investigated using ggpairs. After that, a subset of the most promising independant variables were chosen to examine more closely.
The chosen independant variables were plotted and tested for correlation with quality. Some of the independant variables were plotted with each other to discover interesting patterns and to perhaps create additional features from the ratios of these independant variables.
Finally, a linear model is built to see how well the chemical properties could explain wine quality using a simple model. Turns out, not too well.
The features provided seem ill suited to explain wine quality. This is understandable, however, as human opinions and decision (such as how they rate wine) are notoriously difficult to explain with a handful of variables. The relationships of the variables were very nonlinear and non-modal causing us to believe that different chemical properties will fit different wine in ways that are difficult to estimate, at least using just the features given in this dataset.
The dataset was also quite limited in different ways, for example almost all of the wines in the dataset had qualities of 5-7. It would have been interesting if the dataset had contained more data on very high and very low quality wines in addition to some more features that could explain the variance in quality a bit better.
Besides the shortcomings of the linear model, the exploratory analysis provided some interesting insight on how the chemical properties affect wine quality in this dataset. For example, alcohol seemed to be the most significant variable in defining wine quality. This hopefully doesn’t mean that the wine experts mostly look for alcohol in their wine, but more probably it just could explain the fact that more mature wine often has a higher alcoholic content. And more mature wine is often regard as higher quality. It is good not to confuse correlation with causation when interpreting these results.